HEALTHCARE CLAIMS DATA ANALYSIS 

Cross Reference to Related Application 
This application claims priority from provisional serial no. 60/272,561 , filed 
5 March 1 , 2001 , which is incorporated herein by reference. 

Background of the Invention 
A database of healthcare claims data for analysis may contain data from a 
number of different health plans. Such claims are made from medical 
10 practitioners to insurance carriers for payment. Efforts have been made to 
standardize such data, and every data set undergoes a rigorous data quality 

I* validation process. 

ffl 

II Two important data elements in the analysis of healthcare expenditures 

are 'Charged' (or 'Claimed' or 'Charge') and 'Paid' amounts. "Charged" refers to 
J 15 what a doctor or other practitioner charges the insurance carrier for a service 
provided; "Paid" is what the practitioner is actually paid by the carrier for the 
service. Historically, a significant number of submitted claims data have not 
included Paid amounts (observed in 5-15% of the claims in a representative data 
set). As a result, in past analyses, studies involving costs have relied upon the 
20 Charged amount rather than Paid. 

In many respects, the use the of Charged amount is less than optimal. 
Many pharmaceutical companies and healthcare organizations analyze cost 
based upon actual expenditures rather than an arbitrary Charged amount. 

Paid amounts have typically not been provided in healthcare claims for a 
25 number of reasons, including: (1) in capitated reimbursement models, providers 
receive reimbursement on a per member per month (pmpm) basis, and there is 
no need to provide payment information for each procedure; (2) there are specific 
contractual arrangements between the provider and healthcare organization, and 
such arrangements may vary widely from one organization to the next; and (3) 
30 within an organization arrangements may vary based on product offering or 
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geographical location. Additionally, managed care medical and pharmaceutical 
claims are inherently problematic due to the variety of billing systems and 
processes employed. 

Summary of the Invention 

A system and method according to an embodiment of the present 
invention populate data sets with imputed charged and paid amounts. This 
system and method allow for more comprehensive and applicable analyses of 
healthcare expenditures. 

In a preferred embodiment, two new fields are added to the production 
database, called 'pmcharge' and 'pmpaid'. If the charged or paid fields in a data 
set have invalid data (e.g., a value less than or equal to zero), the amount is 
imputed and entered into the appropriate pm field. On the other hand, if the 
submitted data have valid charged or paid values, those amounts are used. 

This method can be used to impute a paid amount in the absence of valid 
paid data, but in presence of valid charged data, or vice versa. The imputation 
method includes determining a quotient to apply to the valid value (charged or 
paid). The quotient is specific to each data set as well as to each ETG record 
type (Management, Ancillary, Pharmacy, Facility, and Surgery). This method 
ensures a high degree of validity. 

Healthcare claims data can be more accurately and completely analyzed 
with the values included. Other features will become apparent from the following 
detailed description and claims. 

Detailed Description 

In an embodiment of the present invention, a system processes 
healthcare claims data according to a method that includes the following 
processes: 

a) In each data source, estimate the percentage of (1) missing Paid values, 
(2) Paid values with 0, and (3) Paid values less than 0. If these Paid 
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values are less than 30%, the data set continues to be processed. If the 
Paid values are more than 30%, the data set is combined with other 
similar data sets (from the same region) and processing continues. 

b) Create a "learning sub-sample", where only those observations with non- 
zero values of Paid and Charge>=Paid are included. 

c) Estimate a coefficient of correlation for each data source. Check if the 
coefficient is less than 0.6. If the coefficient is less than 0.6, investigate 
for possible contamination or extreme outliers. 

d) Estimate the slope of a regression line with an intercept forced through 
zero. Check the quality of fit (is the value of R 2 less than 0.5?). 

e) Create a variable, Rate = Paid/Charge, where values are more than 0 but 
less than 1 on the "learning sub-sample". If records contain values of <=0, 
ignore as estimation cannot be performed. 

f) Estimate mean and median values for distribution of the Rate-variable for 
each data source and each type of claim separately and for the combined 
sample (the whole abstract). 

g) Estimate the slope of the regression line, e.g., using Iteratively Re- 
weighted Least Squares (IRLS) estimates with the median value of Rate 
as the initial value. 

h) Create a variable "pmpaid" (estimated Paid amount) using the estimated 
median Rate (from step e), multiplied by Charge (separately by each data 
source and each type of claim) for non-negative values of Charge. 

pmpaid = Charge* Median (Paid/Charge) 

The same methodology can be implemented in the reverse order in the 
event there are valid values of the Paid variable, corresponding to zero or 
negative values of Charge variable. The advantage of using the median of Rate 
is that in this case, one can estimate the unknown value of Charge using the 
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same "learning sub-sample" and the same coefficient Median (Paid/Charge), 
creating new variable, 

pmcharge = Paid/ Median ( Paid/Charge). 

Rules for Estimating Charge and Paid 
If Charge>=Paid>0, then 

pmpaid = Paid, pmcharge = Charge 
If Charge and Paid are both invalid (0 or less), then 

pmpaid =0 and pmcharge=0 
If Paid<=0 and Charge>0, then 

pmpaid = Charge * Median (Paid/Charge), 

pmcharge = Charge 
If Paid>0 and Charge<=0, then 

pmpaid = Paid, 

pmcharge = Paid/Median (Paid/Charge) 
If Paid>0 and Charge>0, but Paid >Charge, then 
pmpaid = Paid, 

pmcharge = Paid/Median (Paid/Charge). 

Preliminary Statistical Analysis of Data 

Preliminary statistical analysis of data detected a significant difference 
between the empirical distribution and normal distribution for the random 
variables, Charge and Paid. This difference can be explained by several factors: 
(1) only values greater than zero are analyzed; (2) there are a high number of 
outliers; and (3) the data is largely skewed and non-homogenous. The 
consequence is that the use of methods based on an assumption of normal 
distribution can lead to biased or inconsistent results. 
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The hypothesis of Charge>=Paid was confirmed using Sign-Test, which 
showed that a one-sided test comparing the variables was significantly larger 
than zero. 

Non-homogeneity of the sample was confirmed by results of the General 
Linear Models procedure, with Duncan multiple range test comparing mean 
values of variables Charge and Paid, classified by categorical variable Rectype 
(type of service claim records). 

As means with the same grouping letter are not significantly different, the 
data demonstrates the variability based on record type. 

It was believed that there was a strong correlation between the Charge 
and Paid variables. Preliminary statistical analysis on 21 different data sources 
showed significantly high correlation coefficients. 

Ratio Estimate 

A ratio estimate approach is based on the distribution of ratio for two 
random variables, Paid and Charge. This ratio (Rate) is also a random variable 
with values from 0 to1 . Result of an SAS output based on one data source and a 
chart of Rates at 0.05 intervals versus numbers of records are provided in the 
incorporated provisional application. 

To estimate an unknown parameter K for predicting Paid as (K) (Charge), 
the sample mean value of the variable can be used, where Rate = Paid/Charge 
or a more robust method such as sample median. Because of the prevalence of 
extreme outliers the latter was employed. 

Iterativelv Re-weighted Least Squares (IRLS) 

Classical methods of regression analysis may not be valid when data does 
not follow normal distribution, has significant outliers, or is relatively small in size. 
In the case when errors in predictors are large, the use of ordinary least squares 
estimates can lead to bias and, sometimes, inconsistent estimates of unknown 
parameters. Least squares estimates are only optimal in the case of normal 
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distribution. For example, for exponential distribution, the best estimates are 
derived from the method of minimization of the sum of absolute values of 
residuals. In this case, it is more promising to implement so-called "robust 
estimates," which use methods that are not sensitive to changes to the 
assumptions, on the type of distribution, or existence of contamination and 
outliers in the distribution. 

Several different methods of robust estimation were considered other than 
IRLS. Robust estimates for parameter of location can be used instead of 
ordinary sample mean, which is an efficient estimate of normally distributed 
random variables. Median, vinsorized mean, and oc-trimmed mean are examples 
of the most frequently used robust estimates. 

Robust estimates for parameter of regression can be used instead of 
ordinary estimates (minimizing sum of squares of residuals from the regression 
line), estimates of least sum of absolute values of residuals, M-estimates 
(proposed by Huber replaces the squared residuals by another function), and 
estimates of least median of squares (LMS) of residuals. 

Another property of LMS estimates is that it is equivariant with respect to 
linear transformations on the explanatory variables, because LMS uses 
residuals. The main disadvantage of LMS estimates is their slow convergence 
Rate. LMS estimates tend to perform poorly from the point of view of asymptotic 
efficiency (bad performance on small sample sizes). So for acceptable results 
using this method, large sample sizes are necessary. To improve this situation, 
LTS-estimates (least trimmed squares) were proposed. Compared to ordinary 
least squares, the only difference is that the largest squared residuals are not 
used in the summation, thereby minimizing the effect of large outliers on the 
best-fit line. 

IRLS estimates are weighted least squares using the residuals (how far 
outlying the observations are) as weights. The weights dampen the effect of 
outliers and are revised with each iteration until a robust fit is obtained. Different 
weight functions refer to different IRLS procedures, where the choice of proper 
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weight functions can be done more correctly, if a priori information regarding the 
parametric type of distribution exists. 

While the robust regression method was slightly more accurate than ratio 
estimate in most cases, but it can be resource intensive in terms of processing 
time. The similar results of the ratio estimate and robust regression method 
provide confidence that ratio estimates is statistically sound. Also, because ratio 
estimates were far simpler to perform and faster in terms of processing time, it 
was chosen as more preferable for imputing unknown Charge or Paid values. 

Variability bv Record Type 

The coefficient varies not only from one data set to another, but also by 
type of record. Record type are denoted as F-Facility, P-Pharmacy, A-Ancillary, 
S-Surgery, M-Management. Exact values of the slopes for different data sets 
and different types of records are shown in the table and chart in the 
incorporated provisional application. 

The most consistent slope between the data sets is in Pharmacy claims, 
but the wide variance amongst the data sets by record type supports the 
assumption that imputation should be performed by record type. 

The methods of the present invention can be implemented with a 
conventional computer or group of computers operatively connected to a storage 
system, such as a conventional database. The data that is determined according 
to the methods are useful to provide to the pharmaceutical industry data relating 
to actual costs of procedures. 

Having described an embodiment, it should be apparent that modifications 
can be made without departing from the scope of the invention as defined by the 
appended claims. 
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